Have you ever liked a TV show so much that watching it wasn’t enough anymore? That you wanted to know things nobody would tell you? Read on and see how I used R to see what’s inside South Park!

So I have the idea, but where do I start?

South Park is an American TV show for adults. It is well known for being very satirical. Pretty much every famous person has already been made fun of in the series. I literally watch it every day! I also do lots of analyses in R every day. I just thought to myself, why haven’t I analysed South Park texts yet? What is the overall sentiment of the show? How does the episode popularity evolve over time? Who is the naughtiest character? Or are more naughty episodes also more popular? I’ll answer these and more question in this article series about South Park.

But first things first. I had to find a resource with all the text in a reasonable format. It took just a bit of Googling to find a South Park gold mine! I typed South Park scripts into Google and the very first link was exactly what I was looking for! South Park archives–a page with community maintained scripts for all episodes! Isn’t that great?

You can find a list of seasons on that page. And after clicking on a season, an episode list comes up. An episode page contains a nice table with two columns. The first column is a character name. And the second column is the actual line that character said. That’s a perfect start.

There was one last thing I wanted to know about each episode. Their popularity! I’m sure that you know IMDB–Internet Movie Database. It contains ratings for all movies and Tv shows as well.

But how to put it all together? I wrote an R package called southparkr that anyone can use and do their own analyses! That package downloads all the information described above and makes it conveniently available. It simply does the hard work for you and allows you to focus on your analyses.

Data acquired. BINGO! Let’s dig in.

The second step was to determine, what exactly do I want to analyse? In this article, I decided on doing two things:

  1. Sentiment analysis of episodes,
  2. Episode popularity based on IMDB ratings.

We’ll get to that in a minute. We should first have a look at the data we acquired. Have a look at the following table. It summarises all episodes in a few numbers.

Number of seasons: 21
Number of episodes: 287
Number of words: 907 797
No stopwords (a, the, this, …): 310 759
% used for analysis: 34.23
Average IMDB rating: 8.14
Best episode (9.6): Scott Tenorman Must Die S05E04
Worst episode (6.3): Funnybot S15E02

You can see that the show has been on for 21 seasons already. All the characters combined have said almost 1 million words! That is if we count all words. If we exclude stop words, we end up with about 300 thousand words. Stop words are preposition, articles or other very usual words.

All the episodes sustain an average rating of roughly 8.1 which is great! It seems that the show is popular. I always consider anything above rating 8 very watchable! You can also see the best and the worst episode. So in case you don’t know the show, this is where you might start. It is almost guaranteed that you won’t be disappointed.

Let’s get sentimental, let’s get dirty!

We’ll tackle the first analysis now. The sentiment analysis of South Park episodes is a type of text analysis that scores words. The scores are positive and negative and can be expressed by numbers or words. We will be using the AFINN dictionary that scores words from -5 to 5. Where -5 is a very negative word, 0 is neutral and +5 is very positive.

For example, a -5 word is a bastard and a +5 word is thrilled!

All of this has been prepared for you behind the curtain. You will now see a few lines of code in R that show you a sentiment score of all episodes.

gg <- ggplot(by_episode, aes(x = episode_number, y = mean_sentiment_score, group = 1, text = text_sent)) +
    geom_col(color = "#592a88") +
    geom_smooth()

ggplotly(gg, tooltip = "text")

It created an interactive plot! You can hover over the bars to see some information. Each bar is an episode–you’ll see an episode name, number and the sentiment score upon hovering.

It’s just a few lines of code and the result is great! And above all, it is almost like writing an English sentence. This is how R programming looks like using the Tidyverse suite of packages.

[“Intro to Tidyverse” course banner]

You can see that most of the episodes have the bar pointing down, below zero. That’s mostly because the characters aren’t afraid to use dirty words. And they do it quite a lot!

You might also notice a blue line in the plot. It shows a trend in the sentiment over time. I can say that there was a large increase in the score from the beginning. It peaked roughly around episode 80 and then started falling again. You can simply see that the used language changes somehow over time.

Conclusion

In this article, you’ve learned that sentiment analysis scores words using a subjective dictionary. You’ve also seen how to use such information to get an overall fell of the show. We’ve put it all together to make an interactive plot with just a few lines of R code.

I have shown you that once you have an idea, nothing is impossible. Answering your own questions using R is easy. Be curious and do what you like! Even though the overall rating of the show keeps decreasing, I will not stop watching it. I still like it a lot!

Knowing R is a very valuable skill nowadays. You can start with the basics at Vertabelo Academy. I will personally recommend learning the Tidyverse. I use it in every analysis and I can’t really imagine a woRld without it!

If you already know R and want to explore the data on your own, check out my GitHub repository. The page comes with instructions on how to do that. Good luck and enjoy your own analyses!